Chapter 2 Bias assessment

The maps demonstrate the geographic gradients of obesity rate in the state of Utah. The left panel is estimates at the census tract level, while the right panel is estimates at the county level.

UT, Diabetes maps

tmap_arrange(UT_Diabetes_tract, UT_Diabetes_county, ncol=2)

NY, Obesity maps

tmap_arrange(NY_Obesity_tract, NY_Obesity_county, ncol=2)

NY, Diabetes maps

tmap_arrange(NY_Diabetes_tract, NY_Diabetes_county, ncol=2)

The following table presents the census tract level rates of obesity and Diabetes in the state of Utah.

2.1 Bias due to omitted confounders

\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_2 + \dots + \epsilon_i; \;\; for \;\; i=1, \dots, n\]

where the errors \(\epsilon_i \sim N(0, \sigma^2)\) with independent and identically distributed (i.i.d.)

Let’s assume the following association is true (i.e., gold standard) without any selection bias, measurement bias, and other unmeasured confoundings.

N <- 100000
C <- rnorm(N)
X <- .5 * C + rnorm(N)
Y <- .3 * C + .4 * X + rnorm(N)

2.1.1 Gold standard

With the correct model specification (i.e., \(C\) as a confounder), we get an unbiased estimate of \(X\) on \(Y\).

# Gold standard
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0002838  0.0031611   -0.09    0.928    
## X            0.3946515  0.0031562  125.04   <2e-16 ***
## C            0.3000512  0.0035321   84.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9992331)
## 
##     Null deviance: 140183  on 99999  degrees of freedom
## Residual deviance:  99920  on 99997  degrees of freedom
## AIC: 283716
## 
## Number of Fisher Scoring iterations: 2

2.1.2 Misspecified model: a confounder, \(C\), was omitted from the model

By omitting \(C\), the estimate of \(X\) was biased either “away from” or “towards to” the null

# C was omitted
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0009285  0.0032732  -0.284    0.777    
## X            0.5140045  0.0029264 175.645   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.071334)
## 
##     Null deviance: 140183  on 99999  degrees of freedom
## Residual deviance: 107131  on 99998  degrees of freedom
## AIC: 290682
## 
## Number of Fisher Scoring iterations: 2

2.1.3 Bias “away from” or “towards to” the null?

N <- 100000
C <- rnorm(N)
X <- -.5 * C + rnorm(N)
Y <- -.3 * C + .4 * X + rnorm(N)
# C was omitted
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006061   0.003159  -1.919    0.055 .  
## X            0.395909   0.003151 125.638   <2e-16 ***
## C           -0.303340   0.003536 -85.791   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9977296)
## 
##     Null deviance: 140646  on 99999  degrees of freedom
## Residual deviance:  99770  on 99997  degrees of freedom
## AIC: 283565
## 
## Number of Fisher Scoring iterations: 2
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.005774   0.003273  -1.764   0.0777 .  
## X            0.516762   0.002921 176.933   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.071154)
## 
##     Null deviance: 140646  on 99999  degrees of freedom
## Residual deviance: 107113  on 99998  degrees of freedom
## AIC: 290665
## 
## Number of Fisher Scoring iterations: 2

2.1.4 A \(C\) is not a confounder on \(X\) and \(Y\)

N <- 100000
C <- rnorm(N)
X <- rnorm(N)
Y <- .4 * X + rnorm(N)

2.1.5 Correct model specification: Without \(C\)

glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.002183   0.003155  -0.692    0.489    
## X            0.399391   0.003137 127.306   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9951843)
## 
##     Null deviance: 115645  on 99999  degrees of freedom
## Residual deviance:  99516  on 99998  degrees of freedom
## AIC: 283309
## 
## Number of Fisher Scoring iterations: 2

2.1.6 Misspecified model with \(C\)

glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.002165   0.003155  -0.686    0.493    
## X            0.399398   0.003137 127.308   <2e-16 ***
## C           -0.003133   0.003156  -0.993    0.321    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9951844)
## 
##     Null deviance: 115645  on 99999  degrees of freedom
## Residual deviance:  99515  on 99997  degrees of freedom
## AIC: 283310
## 
## Number of Fisher Scoring iterations: 2

2.1.7 A \(C\) is a colloder on \(X\) and \(Y\)

N <- 100000
X <- rnorm(N)
Y <- .7 * X + rnorm(N)
C <- 1.2 * X + .6 * Y + rnorm(N)

2.1.8 Correct model specification: Without \(C\)

glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.002391   0.003162   0.756     0.45    
## X           0.699815   0.003164 221.198   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9996449)
## 
##     Null deviance: 148874  on 99999  degrees of freedom
## Residual deviance:  99962  on 99998  degrees of freedom
## AIC: 283756
## 
## Number of Fisher Scoring iterations: 2

2.1.9 Misspecified model with \(C\)

This is one of examples of selection bias. For example, let’s say, \(X\) is Education, \(Y\) is income, and \(C\) is social welfare program. People at lower education (i.e., high risk group in terms of exposure) and lower income (i.e., higher risk group in terms of outcome) are more likely to register social welfare program. If survey was conducted based on the registered social welfare program, the “estimated” association from this “disproportionally selected” respondents are likely biased.

glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.003165   0.002712   1.167   0.2432    
## X           -0.015171   0.004648  -3.264   0.0011 ** 
## C            0.441410   0.002329 189.493   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7355337)
## 
##     Null deviance: 148874  on 99999  degrees of freedom
## Residual deviance:  73551  on 99997  degrees of freedom
## AIC: 253077
## 
## Number of Fisher Scoring iterations: 2

2.2 Overadjustment bias

Please note that this is not a comprehensive example; only reflect one aspect of potential overadjustement bias.

Let’s assume a model with \(M\) as a mediator.

N <- 100000
X <- rnorm(N)
M <- .5 * X + rnorm(N)
Y <- .3 * X + .4 * M + rnorm(N)

2.3 Total effect

glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003217   0.003408  -0.944    0.345    
## X            0.498535   0.003399 146.662   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.161171)
## 
##     Null deviance: 141091  on 99999  degrees of freedom
## Residual deviance: 116115  on 99998  degrees of freedom
## AIC: 298735
## 
## Number of Fisher Scoring iterations: 2

2.4 Overadjustment

glm.unbiased <- glm(Y~X + M, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + M, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003242   0.003165  -1.024    0.306    
## X            0.300231   0.003528  85.101   <2e-16 ***
## M            0.398702   0.003163 126.039   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.002003)
## 
##     Null deviance: 141091  on 99999  degrees of freedom
## Residual deviance: 100197  on 99997  degrees of freedom
## AIC: 283993
## 
## Number of Fisher Scoring iterations: 2

2.5 Logistic models

2.5.1 Sex as a Confounder, \(C\)

MYY <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 5
                  )

MYN <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 8
                  )

MNY <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 45
                  )

MNN <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 72
                  )


FYY <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 25
                  )

FYN <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 10
                  )

FNY <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 25
                  )

FNN <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 10
                  )

Ex_confounder <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)

Convert Freq table to raw data

library(tidyr)
raw_confounder <- Ex_confounder %>% 
  uncount(freq)
glm.unbiased <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder)
summary(glm.unbiased)
## 
## Call:
## glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_confounder)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.1582     0.1627  -0.972   0.3309  
## SmokingYes    0.6690     0.3397   1.970   0.0489 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 277.26  on 199  degrees of freedom
## Residual deviance: 273.28  on 198  degrees of freedom
## AIC: 277.28
## 
## Number of Fisher Scoring iterations: 4
  • Full model:
glm_logit <- glm(Cancer ~ Smoking + Sex , family=binomial(link = "logit"), data=raw_confounder)
glm_logit
## 
## Call:  glm(formula = Cancer ~ Smoking + Sex, family = binomial(link = "logit"), 
##     data = raw_confounder)
## 
## Coefficients:
## (Intercept)   SmokingYes      SexMale  
##   9.163e-01    4.266e-15   -1.386e+00  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       277.3 
## Residual Deviance: 257   AIC: 263
  • Stratified models
## For males
raw_confounder_M <- raw_confounder[ which(raw_confounder$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_M)
glm_logit_m
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_confounder_M)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##  -4.700e-01    6.672e-16  
## 
## Degrees of Freedom: 129 Total (i.e. Null);  128 Residual
## Null Deviance:       173.2 
## Residual Deviance: 173.2     AIC: 177.2
# For females
raw_confounder_F <- raw_confounder[ which(raw_confounder$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_F)
glm_logit_f
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_confounder_F)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##   9.163e-01    9.400e-16  
## 
## Degrees of Freedom: 69 Total (i.e. Null);  68 Residual
## Null Deviance:       83.76 
## Residual Deviance: 83.76     AIC: 87.76

2.5.2 Sex as a Moderator, \(M\)

MYY <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 5
                  )

MYN <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 4
                  )

MNY <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 45
                  )

MNN <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 68
                  )


FYY <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 25
                  )

FYN <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 14
                  )

FNY <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 25
                  )

FNN <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 14
                  )

Ex_moderator <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)

Convert Freq table to raw data

library(tidyr)
raw_moderator <- Ex_moderator %>% 
  uncount(freq)
  • Full model:
glm_logit <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator)
glm_logit
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_moderator)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##     -0.1582       0.6690  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  198 Residual
## Null Deviance:       277.3 
## Residual Deviance: 273.3     AIC: 277.3
  • Stratified models
## For males
raw_moderator_M <- raw_moderator[ which(raw_moderator$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_M)
glm_logit_m
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_moderator_M)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##     -0.4128       0.6360  
## 
## Degrees of Freedom: 121 Total (i.e. Null);  120 Residual
## Null Deviance:       165.1 
## Residual Deviance: 164.3     AIC: 168.3
# For females
raw_moderator_F <- raw_moderator[ which(raw_moderator$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_F)
glm_logit_f
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_moderator_F)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##   5.798e-01   -2.621e-16  
## 
## Degrees of Freedom: 77 Total (i.e. Null);  76 Residual
## Null Deviance:       101.8 
## Residual Deviance: 101.8     AIC: 105.8